Skip to content

Enable exporting monthly invoice to Iceberg#292

Open
QuanMPhm wants to merge 2 commits into
CCI-MOC:mainfrom
QuanMPhm:259/export_iceberg
Open

Enable exporting monthly invoice to Iceberg#292
QuanMPhm wants to merge 2 commits into
CCI-MOC:mainfrom
QuanMPhm:259/export_iceberg

Conversation

@QuanMPhm

Copy link
Copy Markdown
Contributor

Closes #259. This PR is in draft, since I'm not too clear which dataframe we want to export. Also, I want consensus on the datatypes of the columns we want in the iceberg table. It seems this PR will depend on #285.

@knikolla @naved001 Regardless, for now, this draft should be enough to show you guys what the iceberg operations involved will look like, and the structure of code changes that will need to happen

@jimmysway

Copy link
Copy Markdown
Contributor

Aren't the dataframes that we'll be exporting just the processed billable and non-billable dataframes?

@jimmysway

Copy link
Copy Markdown
Contributor

Is Billable, Cluster Assignment Source, Cluster Assignment Method

Are three new columns that have been created from the backfilling process. going forward. We can just set Cluster Assignment Source and Cluster Assignment method to Something like "Regular Monthly Invoicing Cycle" or something like that.

There is already an Is Billable column or flag or indicator somewhere in the invoicing code it is just a matter of exporting that as well instead of separating billable from non-billable. I think the plan was to ideally export the entire combined dataframe (billable + nonbillable) to the iceberg.

@QuanMPhm @knikolla

@QuanMPhm

QuanMPhm commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

We can just set Cluster Assignment Source and Cluster Assignment method to Something like "Regular Monthly Invoicing Cycle" or something like that.

I am hesistant to include those columns in the Iceberg table. I believe they should just be removed, since they're not useful for anything in invoicing. We had those columns in the first place because it helped in constructing the initial iceberg table.

As for what invoice to export, I fine with exporting the entire "main" dataframe, since that's equivalent to billable + nonbillable, and it is also what's currently in the initial iceberg table.

@knikolla @naved001 I'll wait for your review

@knikolla

knikolla commented May 7, 2026

Copy link
Copy Markdown
Contributor

We shouldn't have Cluster Assignment Source, Cluster Assignment Method in the Iceberg table

Is Billable should be there.

During 2026-03 invoicing, a bug was found where the columns initialized
by the New-PI credit processor (i.e `PI Balance` column), was
being accessed by the PI-SU processor before it was
initialized, causing an KeyError.

To fix this, the codebase has been refactored to allow each processor to
explicitly document which columns they initialize and use, defined in two
new properties, `initializes_columns` and `operates_on_columns`. A helper
function `_init_columns()` is added to initalize columns

Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor
only uses columns that itself or previous processors initialized, and no
column is initialized more than once

Additionally, each column will now be encapsulated as a `InvoiceColumn` instance.
`InvoiceColumn` contains the name, datatype, and default values for each column
This will also enable stricter and clearer type enforcement for data entering
and leaving the pipeline

A new processor `ValidateInputColumnsProcessor` is added to check the input
dataframe to the processing pipeline has prerequisite columns, and to cast
to appropriate types

The e2e test data has been updated to surface the bug that was found.
It did not failed during the PR that introduced the bug [1] because
the test data didn't have the right conditions to trigger the PI-SU processor

Refactored unit tests to accomodate the new
processor by adding a new base test class.

[1] CCI-MOC#279
@QuanMPhm QuanMPhm force-pushed the 259/export_iceberg branch from ad4efbf to bfe5338 Compare June 4, 2026 16:37
@QuanMPhm QuanMPhm marked this pull request as ready for review June 4, 2026 16:37
@QuanMPhm QuanMPhm requested review from jimmysway and larsks June 4, 2026 16:37
@QuanMPhm

QuanMPhm commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@knikolla @naved001 @jimmysway I now consider this PR ready for review. I have rebased it on top of the column tracking PR, as the current state of the e2e test won't allow the invoice to be exported to iceberg easily. Specifically, Iceberg does not allow exporting of columns with a pa.null type, which happens when a column only contains None. The current e2e test does cause some columns to only be None. Since the column tracking PR ensures every column has a non-null type, the issue is removed.

Added new invoice `IcebergInvoice` to export invoice data to Iceberg tables
The export process also includes a schema update step to
allow updates to Iceberg table schema.

New Iceberg integration test added to validate iceberg functionality
E2E test updated to include iceberg exporting
Both tests use a temporary sqlite catalog
@QuanMPhm QuanMPhm force-pushed the 259/export_iceberg branch from bfe5338 to bfbb760 Compare June 4, 2026 16:49
@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@knikolla @naved001 @jimmysway I just learned that pyiceberg does not support interfacing with hadoop catalogs, which is what our current iceberg table in s3 is using. So I'll need to rewrite my PR to use something like PySpark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Append dataframe to Iceberg during invoicing pipeline run

3 participants